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CONVERGENCE RATE OF LINEAR TWO-TIME-SCALE 
STOCHASTIC APPROXIMATION 1 

By Vijay R. Konda and John N. Tsitsiklis 
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We study the rate of convergence of linear two-time-scale stochas- 
tic approximation methods. We consider two-time-scale linear itera- 
tions driven by i.i.d. noise, prove some results on their asymptotic co- 
variance and establish asymptotic normality. The well-known result 
[Polyak, B. T. (1990). Automat. Remote Contr. 51 937-946; Ruppert, 
D. (1988). Technical Report 781, Cornell Univ.] on the optimality of 
Polyak-Ruppert averaging techniques specialized to linear stochastic 
approximation is established as a consequence of the general results 
in this paper. 

1. Introduction. Two-time-scale stochastic approximation methods [Borkar 
(1997)] are recursive algorithms in which some of the components are up- 
dated using step-sizes that are very small compared to those of the remain- 
ing components. Over the past few years, several such algorithms have been 
proposed for various applications [Konda and Borkar (1999), Bhatnagar, 
Fu, Marcus and Fard (2001), Baras and Borkar (2000), Bhatnagar, Fu and 
Marcus (2001) and Konda and Tsitsiklis (2003)]. 

The general setting for two-time-scale algorithms is as follows. Let f(8,r) 
and g(8,r) be two unknown functions and let (8*,r*) be the unique solution 
to the equations 

(1.1) f(0,r) = O, g(e,r) = 0. 

The functions /(•,•) and <?(■,•) are accessible only by simulating or observ- 
ing a stochastic system which, given 8 and r as input, produces F(8, r, V) 
and G(8,r,W). Here, V and W are random variables, representing noise, 
whose distribution satisfies 

f(8,r) = E[F(8,r,V)}, g(8,r) = E[G(8,r,W)} V0,r. 
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Assume that the noise (V, W) in each simulation or observation of the 
stochastic system is independent of the noise in all other simulations. In 
other words, assume that we have access to an independent sequence of func- 
tions F(-, •, V k ) and G(-, -,W k ). Suppose that for any given 6, the stochastic 
iteration 

(1-2) r k+1 = r k + lk G(9,r k ,W k ) 

is known to converge to some h{9). Furthermore, assume that the stochastic 
iteration 

(i.3) e k+1 = e k + lk F(9 k ,h(e k ),v k ) 

is known to converge to 9*. Given this information, we wish to construct an 
algorithm that solves the system of equations (1.1). 

Note that the iteration (1.2) has only been assumed to converge when 9 is 
held fixed. This assumption allows us to fix 9 at a current value 9 k , run the 
iteration (1.2) for a long time, so that r k becomes approximately equal to 
h(9 k ), use the resulting r k to update 9 k in the direction of F(9 k ,r k , W k ), and 
repeat this procedure. While this is a sound approach, it requires an increas- 
ingly large time between successive updates of 9 k . Two-time-scale stochastic 
approximation methods circumvent this difficulty by using different step 
sizes {f3 k } and {7^} and update 9 k and r k , according to 

@k+i = 8k + 0kF{9 k ,r k , V k ), 

n+i = r k + -f k G(9 k ,r k ,W k ), 

where [3 k is very small relative to j k . This makes 9 k "quasi-static" compared 
to r k and has an effect similar to fixing 9 k and running the iteration (1.2) 
forever. In turn, 9 k sees r k as a close approximation of h(9 k ) and therefore 
its update looks almost the same as (1.3). 

How small should the ratio f3 k /^ k be for the above scheme to work? The 
answer generally depends on the functions /(•,•) and <?(•,•), which are typi- 
cally unknown. This leads us to consider a safe choice whereby (3 k /^ k — > 0. 
The subject of this paper is the convergence rate analysis of the two-time- 
scale algorithms that result from this choice. We note here that the analysis 
is significantly different from the case where \\m k {f3 k /^ k ) > 0, which can be 
handled using existing techniques. 

Two-time-scale algorithms have been proved to converge in a variety of 
contexts [Borkar (1997), Konda and Borkar (1999) and Konda and Tsitsiklis 
(2003)]. However, except for the special case of Polyak-Ruppert averaging, 
there are no results on their rate of convergence. The existing analysis [Rup- 
pert (1988), Polyak (1990), Polyak and Juditsky (1992) and Kushner and 
Yang (993)] of Polyak-Ruppert methods rely on special structure and are 
not applicable to the more general two-time-scale iterations considered here. 
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The main result of this paper is a rule of thumb for calculating the asymp- 
totic covariance of linear two-time-scale stochastic iterations. For example, 
consider the linear iterations 

(1.4) 6 k+1 = 6 k + f3 k (h - A n 6 k - A 12 r k + V k ), 

(1-5) r k+1 = r k + 7fc(6 2 - A 2 iQ k - A 22 r k + W k ). 

— 1/2 

We show that the asymptotic covariance matrix of j3 k 9 k is the same as 

— 1/2- 

that of (3 k 9 k , where 9 k evolves according to the single-time-scale stochas- 
tic iteration: 

k +i = Ok + 0k(bi - A u 9 k - A l2 r k + V k ), 
= b 2 -A 21 9 k -A 22 r k + W k . 

\ i 2 

Besides the calculation of the asymptotic covariance of f3 k 9 k (Theorem 2.8) 

— 1/2 

we also establish that the distribution of [5 k {6 k — 6*) converges to a Gaus- 
sian with mean zero and with the above asymptotic covariance (Theo- 
rem 4.1). We believe that the proof techniques of this paper can be extended 
to nonlinear stochastic approximation to obtain similar results. However, 
this and other possible extensions (such as weak convergence of paths to a 
diffusion process) are no pursued in this paper. 

In the linear case, our results also explain why Polyak-Ruppert averaging 
is optimal. Suppose that we are looking for the solution of the linear system 

Ar = b 

in a setting where we only have access to noisy measurements of b — Ar. The 
standard algorithm in this setting is 

(1-6) r k+ i = r fc + 7^(6 - Ar k + W k ), 

and is known to converge under suitable conditions. (Here, W k represents 
zero- mean noise at time k.) In order to improve the rate of convergence, 
Polyak (1990) and Ruppert (1988) suggest using the average 

(1-7) 0k = l^ ri 

K 1=0 

as an estimate of the solution, instead of r k . It was shown in Polyak (1990) 
that if k^ k — > oo, the asymptotic covariance of VkO k is A- l T{A'y 1 , where V 
is the covariance of W k . Furthermore, this asymptotic covariance matrix is 
known to be optimal [Kushner and Yin (1997)]. 

The calculation of the asymptotic covariance in Polyak (1990) and Rup- 
pert (1988) uses the special averaging structure. We provide here an alter- 
native calculation based on our results. Note that 9 k satisfies the recursion 

(1.8) 9 k+1 = 9 k + j—(r k -0k), 
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and the iteration (1.6)— (1.8) for r k and 9 k is a special case of the two-time- 
scale iterations (1.4) and (1.5), with the correspondence bi = 0, An = I, 
A\2 = —I, Vfc = 0, 62 = b, A21 = 0, A22 = 0. Furthermore, the assumption 
k~/k — ► 00 corresponds to our general assumption fik/jk —> 0. 

By applying our rule of thumb to the iteration (1.6)-(1.8), we see that 
the asymptotic covariance of (y/k + 1 )9 k is the same as that of (\fk-\-l )9 k , 
where 9 k satisfies 

9 k+1 = 9 k + ^-(-9 k + A- 1 (b + W k )), 

or 

k— 1 

Sk = -r^2(A- 1 b + A- 1 Wi). 
K 1=0 

It then follows that the covariance of y/kO h is A~ 1 T(A')~ 1 , and we recover 
the result of Polyak (1990), Polyak and Juditsky (1992) and Ruppert (1988) 
for the linear case. 

In the example just discussed, the use of two time-scales is not necessary 
for convergence, but is essential for the improvement of the convergence rate. 
This idea of introducing two time-scales to improve the rate of convergence 
deserves further exploration. It is investigated to some extent in the context 
of reinforcement learning algorithms in Konda (2002). 

Finally, we would like to point out the differences between the two-time- 
scale iterations we study here and those that arise in the study of the tracking 
ability of adaptive algorithms [see Benveniste, Metivier and Priouret (1990)]. 
There, the slow component represents the movement of underlying system 
parameters and the fast component represents the user's algorithm. The fast 
component, that is, the user's algorithm, does not affect the slow compo- 
nent. In contrast, we consider iterations in which the fast component affects 
the slow one and vice versa. Furthermore, the relevant figures of merit are 
different. For example, in Benveniste, Metivier and Priouret (1990), one is 
mostly interested in the behavior of the fast component, whereas we focus 
on the asymptotic covariance of the slow component. 

The outline of the paper is as follows. In the next section, we consider 
linear iterations driven by i.i.d. noise and obtain expressions for the asymp- 
totic covariance of the iterates. In Section 3, we compare the convergence 
rate of two-time-scale algorithms and their single-time-scale counterparts. 
In Section 4, we establish asymptotic normality of the iterates. 

Before proceeding, we introduce some notation. Throughout the paper, 
I • I represents the Euclidean norm of vectors or the induced operator norm 
of matrices. Furthermore, I and represent identity and null matrices, re- 
spectively. We use the abbreviation w.p.l for "with probability 1." We use 
c, ci, C2, • • • to represent some constants whose values are not important. 
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2. Linear iterations. In this section, we consider iterations of the form 

(2.1) 9 k+1 = 9 k + k (bi - A n 9 k - A 12 r k + V k ), 

(2.2) r k+1 = r k + lk (b 2 - A 21 9 k - A 22 r k + W k ), 

where 9 k is in R", r k is in R m , and bi, b 2 , An, A± 2 , A 2 ±, A 22 are vectors 
and matrices of appropriate dimensions. 

Before we present our results, we motivate various assumptions that we 
will need. The first two assumptions are standard. 

Assumption 2.1. The random variables (V k , W k ), k = 0,1, ... , are in- 
dependent of 7"o, #0) an d of each other. They have zero mean and common 
covariance 

E[v k vl] = T n , 

E[V k W' k ]=Y 12 = Y' 21 , 
E[W k W' k ]=T 22 . 

Assumption 2.2. The step-size sequences {7fc} and {f3 k } are determin- 
istic, positive, nonincreasing, and satisfy the following: 

1- Efe7fe = EfcA = oo. 

The key assumption that the step sizes (3 k and j k are of different orders 
of magnitude is subsumed by the following. 

Assumption 2.3. There exists some e > such that 




For the iterations (2.1) and (2.2) to be consistent with the general scheme 
of two-time-scale stochastic approximations described in the Introduction, 
we need some assumptions on the matrices Ay. In particular, we need it- 
eration (2.2) to converge to A2~ 2 (b 2 — A 2 i0), when 9 k is held constant at 9. 
Furthermore, the sequence 9 k generated by the iteration 

Ok+i = Ok + 0k(bi ~ A 12 A^b 2 - (A n - A 12 A^A 21 )9 k + V k ), 

which is obtained by substituting A2~ 2 (b 2 — A 2 i9 k ) for r k in iteration (2.1), 
should also converge. Our next assumption is needed for the above conver- 
gence to take place. 
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Let A be the matrix defined by 
(2.3) A = A n -A 12 A^A 21 . 

Recall that a square matrix A is said to be Hurwitz if the real part of each 
eigenvalue of A is strictly negative. 

Assumption 2.4. The matrices — A 22 , —A are Hurwitz. 

It is not difficult to show that, under the above assumptions, {Ok-> r k) 
converges in mean square and w.p.l to (0*,r*). The objective of this paper 
is to capture the rate at which this convergence takes place. Obviously, this 
rate depends on the step-sizes /3fc>7fc> and this dependence can be quite 
complicated in general. The following assumption ensures that the rate of 
mean square convergence of {Qki r k) to (6*,r*) bears a simple relationship 
(asymptotically linear) with the step-sizes /3fc,7fc- 

Assumption 2.5. 1. There exists a constant j3 > such that 

hm(^ 1 -/3 fc - 1 )=^. 



2. If e = 0, then 



lim( 7fc - + 1 1 -7 fc - 1 ) = 0. 



3. The matrix —(A — ^1) is Hurwitz. 

Note that when e > 0, the iterations (2.1) and (2.2) are essentially single- 
time-scale algorithms and therefore can be analyzed using existing tech- 
niques [Nevel'son and Has'minskii (1973), Kusher and Clark (1978), Ben- 
veniste, 

Metivier and Priouret (1990), Duflo (1997) and Kusher and Yin (1997)]. 
We include this in our analysis as we would like to study the behavior of 
the rate of convergence as e J 0. The following is an example of sequences 
satisfying the above assumption with e = 0, ft = 1/(ti/?o): 

7fc= (l + W 2 <0<1 ' 



(l + k/nV 

Let 9* G R m and r* £ R n be the unique solution to the system of linear 
equations 

A 11 9 + A 12 r = b 1 , 



A 2X e + A 22 r = b 2 . 
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For each k, let 

(2.4) 

and 



r k - A 2 ^{b 2 - A 2l 9 k ) 



Ej a = (S^ 1 ) , = /9fc 1 £7M, 



r v& vfc 

^11 ^12 
2j 21 Zj 



22 



Our main result is the following. 



Theorem 2.6. Under Assumptions 2.1-2.5, and when the constant e of 
Assumption 2.3 is sufficiently small, the limit matrices 



(2.5) 



,(e) 



limE 



11- 



J 12 



limS 



exisi. Furthermore, the matrix 

E (0) 



y(0) 
^11 



s 



(0) 
21 



12' 



v (0) 
^12 

^22 



,(e) 
J 22 



lim £ 



22 



is t/ie unique solution to the 


following system 


of equations 




(2.6) 




5 V (°) 4. 4 v(°) 
~ P^U +^12^21 


T ^12 ^12 — 


Tu, 


(2.7) 




^12^22 


' ^12 ^22 — 


Tl2, 


(2.8) 




^22^22 


+ ^22 ^22 — 


T22- 


Finally, 










(2.9) 


limES?=Eff, 

ej.0 


llm2j 12 — ^12 > 

ej.0 


lim Eg 

ej.0 


_ v (o) 

— ^22 



Proof. Let us first consider the case e = 0. The idea of the proof is to 
study the iteration in terms of transformed variables: 



(2.10) 



r k = L k e k +f k , 



for some sequence ofnxm matrices {L k } which we will choose so that the 
faster time-scale iteration does not involve the slower time-scale variables. 
To see what the sequence {L k } should be, we rewrite the iterations (2.1) 
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and (2.2) in terms of the transformed variables as shown below (see Sec- 
tion A.l for the algebra leading to these equations): 

6 k +i = 0k~ 0k(B^9 k + A 12 r k ) + p k V k , 

f fc+ i = f k - ik{B% x d k + B^ 2 r k ) + -f k W k + Pk(L k +i + A 2 - 2 1 A 2 i)V k , 



(2.11) 
where 



Bu = A — Ai 2 Lk, 



B 2 i — — + — (Lk+i + A 22 1 A 2 i)Bi 1 — A 22 Lk, 

Ik Ik 

B 22 = — (Lk+i + A2 2 A 2 i)Ai 2 + A 22 . 

Ik 

We wish to choose {Lk} so that B 21 is eventually zero. To accomplish this, 
we define the sequence of matrices {Lk} by 

L k = 0, 0<k<k , 

(2.12) 

L k+ i = (L k - lkA 22 L k + p k A£A 21 Bh)(I - faB^y 1 Vk>k , 

so that 2?2i = f° r an k > k(j. For the above recursion to be meaning- 
ful, we need (I - p k B^) to be nonsingular for all k > ko. This is handled 
by Lemma A.l in the Appendix, which shows that if ko is sufficiently large, 
then the sequence of matrices {Lk} is well defined and also converges to 
zero. 

For every k>ko, we define 

S ii = k lE [6k8 k ]: 

(t k 21 y = t k ; 2 = p^E[9 k f' k ], 

Using the transformation (2.10), it is easy to see that 

S^k yi/c 

^11 — ^11 1 



12: 



S~ k yifc r / I y-i 

12 — ^11-^jfc "+~ ^ 

^22 = ^22 + ( — ) (-^fe^l2 + ^21-kfe + Lk£\lL k ) 



Since Li. — > 0, we obtain 



Tit 



limE^limE^ 

limSi 2 = limSi 2 , 

fc fc 

limS^ 2 =limE^ 2 , 

k k 
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provided that the limits exist. 

To compute lim^. Er>2, we use (2.11), the fact that B 21 = for large enough 
k, the fact that B 22 converges to ^22, and some algebra, to arrive at the 
following recursion for S^- 

(2.13) = + 7fc(r22 — ^-22^22 — ^22^22 + ^22(^22))) 

where S k 2 (') ^ s some matrix- valued affine function (on the space of matrices) 
such that 

lim 522(^22) = for all £22- 

k 

Since — A 22 is Hurwitz, it follows (see Lemma A. 2 in the Appendix) that the 
limit 

limS^2 = hmS22 = S 2°2 ) 

exists, and T, 2 ° 2 satisfies (2.8). 
Similarly, T, k 2 satisfies 

(2.14) E* 2 +1 = t\ 2 + lk (T 12 - A 12 Eg - t k 2 A' 22 + S k 12 (£ k 12 )) 

where, as before, 5 k 2 (-) is an affine function that goes to zero. (The coeffi- 
cients of this affine function depend, in general, on E^, but the important 
property is that they tend to zero as k — > 00.) Since —^22 is Hurwitz, the 
limit 

limS^ = limS^2 = sf 2 ) 
exists and satisfies (2.7). Finally, E^ satisfies 

_ _|_ p k (T u - ^^E^ - E^-A'i2 ~ 

(2.15) 

-E^A' + ^+^E^)), 

where (5^ ( - ) is some affine function that goes to zero. (Once more, the coef- 
ficients of this affine function depend, in general, on E| 2 and E^, but they 

tend to zero as k — > 00.) Since —(A — ^1) is Hurwitz, the limit 

limE^ =limSf, =E|° ) 
k k 11 

exists and satisfies (2.6). 

The above arguments show that for e = 0, the limit matrices in (2.5) exist 
and satisfy (2.6)-(2.8). To complete the proof, we need to show that these 
limit matrices exist for sufficiently small e > and that the limiting relations 
(2.9) hold. As this part of the proof uses standard techniques, we will only 
outline the analysis. 
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Define for each k, 




The linear iterations (2.1) and (2.2) can be rewritten in terms of Zk as 

Zk+i = Zk — PkBkZk + PkUk, 

where Uk is a sequence of independent random vectors and {Bk} is a se- 
quence of deterministic matrices. Using the assumption that (3k/ Ik converges 
to e, it can be shown that the sequence of matrices Bk converges to some 
matrix B^ and, similarly, that 

lim E[U k U' k ]=T^ 

for some matrix r( e '. Furthermore, when e > is sufficiently small, it can 
be shown that — (B^ — &I) is Hurwitz. It then follows from standard the- 
orems [see, e.g., Polyak (1976)] on the asymptotic covariance of stochastic 
approximation methods, that the limit 

lim(3^E[Z k Z' k ] 

exists and satisfies a linear equation whose coefficients depend smoothly on e 
(the coefficients are infinitely differentiable w.r.t. e). Since the components of 
the above limit matrix are E^, E^ and £22 m °dulo some scaling, the latter 
matrices also satisfy a linear equation which depends on e. The explicit form 
of this equation is tedious to write down and does not provide any additional 
insight for our purposes. We note, however, that when we set e to zero, this 
system of equations becomes the same as (2.6)-(2.8). Since (2.6)-(2.8) have 
a unique solution, the system of equations for E^ , E^ an d E 2 ^ also has 
a unique solution for all sufficiently small e. Furthermore, the dependence 
of the solution on e is smooth because the coefficients are smooth in e. □ 

Remark 2.7. The transformations used in the above proof are inspired 
by those used to study singularly perturbed ordinary differential equations 
[Kokotovic (1984)]. However, most of these transformations were time-invariant 
because the perturbation parameter was constant. In such cases, the ma- 
trix L satisfies a static Riccati equation instead of the recursion (2.12). In 
contrast, our transformations are time-varying because our "perturbation" 
parameter Pk/"fk is time- varying. 

In most applications, the iterate rk corresponds to some auxiliary param- 
eters and one is mostly interested in the asymptotic covariance Eff of 9 k . 
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Note that according to Theorem 2.6, the covariance of the auxiliary param- 
eters is of the order of 7^, whereas the covariance of 6 k is of the order of (3 k - 
With two time-scales, one can potentially improve the rate of convergence 
of 6k (cf. to a single-time-scale algorithm) by sacrificing the rate of conver- 
gence of the auxiliary parameters. To make such comparisons possible, we 
need an alternative interpretation of S^q , that does not explicitly refer to 
the system (2.6)-(2.8). This is accomplished by our next result, which pro- 
vides a useful tool for the design and analysis of two-time-scale stochastic 
approximation methods. 

Theorem 2.8. The asymptotic covariance matrix of (5 k l ^ 2 Qk is the 

— 1/2- 

same as the asymptotic covariance of (3 k 6k, where 6k is generated by 

(2.16) 9 k +i = h + Pk{h - A n 6 k - A 12 f k + V k ), 

(2.17) = b 2 -A 21 9 k -A 22 r k + W k . 
In other words, 

k 

Proof. We start with (2.6)-(2.8) and perform some algebraic manipu- 
lations to eliminate an d £ 2 2 ■ This leads to a single equation for , 
of the form 



— Tn — Ai 2 A 22 l T 2 i - ri 2 (A 22 ) 1 A' l2 + Ai 2 A 22 T 22 (A 22 ) 1 A' l2 . 

Note that the right-hand side of the above equation is exactly the covariance 
of Vk — A\ 2 A 22 Wk- Therefore, the asymptotic covariance of 6k is the same 
as the asymptotic covariance of the following stochastic approximation: 

e k+ i = h + h(-^h + v k - A l2 A 22 l w k ). 

Finally, note that the above iteration is the one obtained by eliminating r k 
from iterations (2.16) and (2.17). □ 

Remark. The single-time-scale stochastic approximation procedure in The- 
orem 2.8 is not implementable when the matrices Aij are unknown. The 
theorem establishes that two-time-scale stochastic approximation performs 
as well as if these matrices are known. 

Remark. The results of the previous section show that the asymptotic 
\j 2 

covariance matrix of (3 k 6k is independent of the step-size schedule {7^} 
for the fast iteration if 

Pk 

— ^0. 

Ik 
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4 = 01 



To understand, at least qualitatively, the effect of the step-sizes 7& on the 
transient behavior, recall the recursions (2.13)-(2.15) satisfied by the covari- 
ance matrices E fe : 

jiii" 1 = ^11 + Pkfin — A i2 T,fi — S^^4.'i2 

-AE^-EfiA'-^ + ^CSji)), 

_ fk^ _)_ -y k (T 12 - Ai 2 T,$ - Si2^22 + ^12 (^12)), 
^22 _1 = S 22 + 7*^22 - A 2 2^22 ~ ^22^22 + ^22(^22))' 

where the S k j(-) are affine functions that tend to zero as k tends to infinity. 
Using explicit calculations, it is easy to verify that the error terms 5^ are of 
the form 

6 k u = A 12 (t h 21 - e£>) + (Ef 2 - EgVi2 + 0(J3 k ), 
V7fc/ 

To clarify the meaning of the above relations, the first one states that the 
affine function 5 k i (En) is the sum of the constant term j4i 2 (E 21 — E 2 i ) + 

(T<i2 ~ ^12 1)^12 j an( i an °ther affine function of E^ whose coefficients are 
proportional to (3k- 

The above relations show that the rate at which E^ converges to E^ 
depends on the rate at which E* 2 converges to s[ 2 \ through the term 8 k i- 
The rate of convergence of £y 2 , in turn, depends on that of £ 22 , through the 
term 5 k 2 - Since the step-size in the recursions for £ 22 and Ef 2 is 7^, and the 
error terms in these recursions are proportional to (3k/ Ik, the transients de- 
pend on both sequences {7^} and {(3k/ Jk}- But each sequence has a different 
effect. When 7^ is large, instability or large oscillations of r& are possible. 
On the other hand, when (3k/ Ik is large, the error terms Sfj can be large and 
can prolong the transient period. Therefore, one would like to have (3k/ Ik 
decrease to zero quickly, while at the same time avoiding large jk- Apart 
from these loose guidelines, it appears difficult to obtain a characterization 
of desirable step-size schedules. 

3. Single time-scale versus two time-scales. In this section, we compare 

— 1/2 

the optimal asymptotic covariance of P k 1 9 k that 

can be obtained by a re- 
alizable single-time-scale stochastic iteration, with the optimal asymptotic 
covariance that can be obtained by a realizable two-time-scale stochastic 
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iteration. The optimization is to be carried out over a set of suitable gain 
matrices that can be used to modify the algorithm, and the optimality crite- 
rion to be used is one whereby a covariance matrix £ is preferable to another 
covariance matrix £ if £ — X is nonzero and nonnegative definite. 

Recall that Theorem 2.8 established that the asymptotic covariance of a 
two-time-scale iteration is the same as in a related single-time-scale itera- 
tion. However, the related single-time-scale iteration is unrealizable, unless 
the matrix A is known. In contrast, in this section we compare realizable it- 
erations that do not require explicit knowledge of A (although knowledge of 
A would be required in order to select the best possible realizable iteration). 

We now specify the classes of stochastic iterations that we will be com- 
paring. 

1. We consider two-time-scale iterations of the form 

9 k+1 = 6 k + /3 fc Gi(6i - A n k - A l2 r k + V k ), 
rk+i = r k + 7fc(&2 - M\0k ~ A 22 r k + W k ). 
Here, G\ is a gain matrix, which we are allowed to choose in a manner 

— 1/2 

that minimizes the asymptotic covariance of f3 k 6 k . 

2. We consider single-time-scale iterations, in which we have ^ k = (3 k , but 
in which we are allowed to use an arbitrary gain matrix G, in order to 

— 1/2 

minimize the asymptotic covariance of (3 k k - Concretely, we consider 
iterations of the form 

Ok+l 

We then have the following result. 



Ok 
>'k 



+ PkG 



bi - AuOk - A 12 r k + Vk 
b 2 - A 21 9 k - A 22 r k + W k 



Theorem 3.1. Under Assumptions 2.1-2.5, and with e = 0, the mini- 

— 1/2 

mal possible asymptotic covariance of j3 k k , when the gain matrices G\ 
and G can be chosen freely, is the same for the two classes of stochastic 
iterations described above. 



Proof. The single-time-scale iteration is of the form 
Z k+ i = Z k + p k G(b - AZ k + U k ), 

where 



Zk 



Ok 



and 



V 

b 2 



A 



An 
A 2 i 



'V k ' 

w k . 

A 12 ~ 
A 22 
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As is well known [Kushner and Yin (1997)], the optimal (in the sense of pos- 

— 1/2 

itive definiteness) asymptotic covariance of [5 k Z k over all possible choices 
of G is the covariance of A~ l U k . We note that the top block of A~ l U k is 
equal to A~ 1 (V/ C — A^A^W^) ■ It then follows that the optimal asymptotic 

covariance matrix of j3 k 1 ^ 2 Qk is the covariance of A~ 1 (Vk — A^A^Wk). 

For the two-time-scale iteration, Theorem 2.8 shows that for any choice 
of G±, the asymptotic covariance is the same as for the single-time-scale 
iteration: 

Ok+i = k + PkGxfa - A6 k + V k - A 12 A^W k ). 

— 1/2 

From this, it follows that the optimal asymptotic covariance of [3 k 6 k is 
the covariance of A _1 (V^ — A^A^Wk), which is the same as for single- 
time-scale iterations. □ 

4. Asymptotic normality. In Section 2, we showed that fij^ 1 E[6 k 6' k ] con ~ 

verges to S^. The proof techniques used in that section do not extend easily 
(without stronger assumptions) to the nonlinear case. For this reason, we de- 
velop here a different result, namely, the asymptotic normality of 6k, which 
is easier to extend to the nonlinear case. In particular, we show that the 

— 1/2 " 

distribution of /3 k 9k converges to a zero-mean normal distribution with 

covariance matrix E^ . The proof is similar to the one presented in Polyak 
(1990) for stochastic approximation with averaging. 

Theorem 4.1. If Assumptions 2.1-2.5 hold with e = 0, then (3 k l k 
converges in distribution to iV(0, E^ ). 

Proof. Recall the iterations (2.11) in terms of transformed variables 6 
and f. Assuming that k is large enough so that B 21 = 0, these iterations can 
be written as 

h+i = (I- Pk^)0 k - f3 k A 12 f k + (3 k V k + k 8 { k \ 
fjfe+i = (I - 7fc^22)r fc + j k W k + Pk^k + /3k(L k+1 + A^A 21 )V k , 
where S k and S k are given by 

4 1} = AnL k k , 

4 2) = -(L k+1 +A 2 iA 21 )A 12 r k . 

Using Theorem 2.6, E[\0 k \ 2 ]/ f3 k and -E[|ffc| 2 ]/7fc are bounded, which implies 
that 
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(4.1) 

E[\S^\ 2 }<c lk , 

for some constant c> 0. Without loss of generality assume ko = in (2.11). 
For each i, define the sequence of matrices O*- and Rj, j >i, as 

e\ = i, 

ej +1 = ej - Pj Aej vj > *, 

i?* = I, 

R) +1 = R)-^A 22 R) Vj>i. 
Using the above matrices, r& and can be rewritten as 

k— 1 fe— 1 fe— 1 

(4.2) fe = &% - E &®Ui2?i + E fte^v- + £ ftei£ 

and 



i=0 i=0 i=Q 



(4.3) 



fe — 1 fe— 1 

r k = R° k r + £ 7i i?| + £ Ai^f 
i=0 i=0 

fc-1 

+ ^A4(L m + A 2 - 2 1 ,4 21 )y i . 

j=0 



i ii 

Substituting the right-hand side of (4.3) for r k in (4.2), and dividing by f3 k , 
we have 

, fe-i 

p- 1/2 e k = -^Qlh + E ^eiA 12 (/3- l/2 R%) 



i=0 

(4-4) +Eft0U/3T 1/ M 1) ) + 4 1) + ^ ) + ^ 3) 



8=0 



+ E v^euv- + A 12 A^Wi), 



where 



i=o 



V Pfc 

i=0 V j=0 
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= X>eUi2 (V /2 £ PjR{(Lj + i + ^4*)^ , 

2=0 \ j = / 

k—1 2—1 fe— 1 

i=0 j=0 j=0 

We wish to prove that the various terms in (4.4), with the exception of 
the last one, converge in probability to zero. Note that the last term is a 
martingale and therefore, can be handled by appealing to a central limit 
theorem for martingales. Some of the issues we encounter in the remainder 
of the proof are quite standard, and in such cases we will only provide an 
outline. 

To better handle each of the various terms in (4.4), we need approxima- 
tions of Q\ and R\. To do this, consider the nonlinear map A \— > exp(A) 
from square matrices to square matrices. A simple application of the inverse 
function theorem shows that this map is a diffeomorphism (differentiable, 
one-to-one with differentiable inverse) in a neighborhood of the origin. Let 
us denote the inverse of exp(-) by ln(-). Since ln(-) is differentiable around 
I = exp(0), the function e \— > ln(7 — eA) can be expanded into Taylor's series 
for sufficiently small e as follows: 

hx{I-eA) = -e(A-E{e)), 

where E{e) commutes with A and \\m. £ ^,QE{e) = 0. Assuming, without loss 
of generality, that 70 and /3q are small enough for the above approximation 
to hold, we have for k > 0, 

/ fc-i 




9i = exp -^(A- 

\ j=i 

(4-5) 

\ j=i ) 

for some sequence of matrices {E^}, i = 1,2, converging to zero. To obtain 
a similar representation for 9^, note that Assumption 2.5(1) implies 

(4.6) -A. = (l + p k (e k + P)), 

Pk+l 

for some — > 0. Therefore, using the fact that 1 + x = exp(x(l — o(x))) 
and (4.5), we have 

(4.7) ei = exp[-X;/%ffA-|/ 



j=i 
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for some sequences of matrices e[ converging to zero. Furthermore, it is not 

(i) 

difficult to see that the matrices El , % = 1, 2, 3, commute with the matrices 
A, A 2 2 and A - {(3/2)1, respectively. Since -A, — (A — (/3/2)I) and -^4 2 2 are 
Hurwitz, using standard Lyapunov techniques we have for some constants 
ci,c 2 > 0, 

/ fc-i \ 



max(|ei|,|el|) <ciexp -c^^Pj), 



(4.8) 



fc-i 



|i2fc| < ciexp -c 2 X7j • 



3=1 



Therefore it is easy to see that the first term in (4.4) goes to zero w.p.l. To 
prove that the second term goes to zero w.p.l, note that ln/% —PJ^jZoPj 
[cf. (4.6)] and therefore for some ci,c 2 > 0, 

W 1/2 R^r \ < Cl exp^-c 2 X^ - |/5^ , 

which goes to zero as i — > oo (Assumption 2.3). Therefore, it follows from 
Lemma A. 3 that the second term also converges to zero w.p.l. Using (4.1) 
and Lemma A. 3, it is easy to see that the third term in (4.4) converges in 
the mean (i.e., in L\) to zero. Next, consider .EflS^I]- Using (4.1), we have 
for some positive constants ci,c 2 and C3, 



E 



/3T 1/2 E/^4 2) 
3=0 

i-1 / 



i-1 



^ C l X 7? ex P ( ~ C3A) ) \l — 



j=0 



1=3 
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Since 0j/jj — ► 0, Lemma A. 3 implies that converges in the mean to 
zero. To study si , consider 



E 



i-1 



E f3 j Ri(L j+ i+A^A 21 )V j 
3=0 



Since the Vf. are zero mean i.i.d., the above term is bounded above by 



i-1 



i-1 



c i X 7? ex P ~~ X( C2 ^ ~ C3 ^) 



3=0 



1=3 
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(2) 

for some constants ci,c 2 and C3. Lemma A. 3 implies that SI converges 

(3) 

in the mean to zero. Finally, consider Si . By interchanging the order of 
summation, it can be rewritten as 



(4.9) 



fc-l 

j=0 



fc-i 



P. 



i=3 



Since —A22 is Hurwitz, we have 

A 22 = / exp(-A 2 2*) dt, 
Jo 

and we can rewrite the term inside the brackets in (4.9) as 



fc-i 

t=j 



fan 



tk-\ 



Efe-1 



exp(-^4 22 t) 



/ fc-l N 

A12A22 exp - 7i^22 



We consider each of these terms separately. To analyze the first term, we wish 
to obtain an "exponential" representation for jjfti/ Pj'ji- It is not difficult to 
see from Assumptions 2.5 (1) and (2) that 

Pk+i Pk 



-(1 - £fc7fc) 

■exp(-e fc7fc + 0(4 7 !)), 



7fc+i 7fc 
= Pk 
Ik 

where — > 0. Therefore, using (4.5) and the mean value theorem, we have 

7jA , 



< ci sup fei + — ) f V 7; ) exp f c 2 V (e L + 

;>A 7/7 Vft / V U K 



Pi 

ll 



ll 



which in turn implies, along with Lemma A. 4 (with p = 1) and Assump- 
tion 2.3, that the first term is bounded in norm by csup i>:; (e; + 11/ 0i) f° r 
some constant c > 0. The second term is the difference between an integral 
and its Riemannian approximation and therefore is bounded in norm by 
csup;>j 7; for some constant c > 0. Finally, since —A22 is Hurwitz, the norm 
of the third term is bounded above by 

ci exp ^-c 2 E l^j 
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for some constants c±,c 2 > 0. An explicit computation of £7[|$i | 2 ], using 
the fact that (V k , W k ) is zero-mean i.i.d., and an application of Lemma A. 3 

(3) 

shows that Si converges to zero in the mean square. Therefore, the distri- 

— 1/2 ~ 

bution of (5 k 9 k converges to the asymptotic distribution of the martingale 
comprising the remaining terms. To complete the proof, we use the standard 
central limit theorem for martingales [see Duflo (1997)]. The key assumption 
of this theorem is Lindberg's condition which, in our case, boils down to the 
following: for each e > 0, 

\im k ^2E\\X^\ 2 I{\X^\>s}] =0, 

k . L 

1=0 

where / is the indicator function and for each i < k, 

X \ k) = s/Ji&i{Vi + A n A^Wi). 
The verification of this assumption is quite standard. □ 

Remark. Similar results are possible for nonlinear iterations with Markov 
noise. For an informal sketch of such results, see Konda (2002). 

APPENDIX: AUXILIARY RESULTS 

A.l. Verification of (2.11). Without loss of generality, assume that b\ = 
b 2 = 0. Then, 9* = and 

9 k = 9k = 9k, 

and, using the definition of [cf. (2.4) and (2.10)], we have 
(A. 10) f fe = L k 9 k + f k = L k 9 k + r k + A^A 21 9 k = r k + M k 9 k , 
where 

M k = L k + A^A 2 i. 

To verify the equation for 9 k+ \ = 9 k+ i, we use the recursion for 9 k+ i, to 
obtain 



fe+l = &k 


-p k (A n 9 k + A 12 r k -V k ) 




= 9 k 


- p k (A n 9 k + A 12 f k - A 12 (L k 


+ A^A 21 )9 k -V k ) 


= 9 k 


- f3k(An9 k - Ai 2 A 22 A 2 i9 k - 


A 12 L k 9 k + A 12 r k - V k ) 


= &k 


- p k (A9 k - A 12 L k 9 k + A 12 f k ] 


+ PkV k 


= 9 k 


- AfcOB^A + A 12 h) + AcVfc, 





where the last step makes use of the definition = A — A\ 2 L k . 
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To verify the equation for r k +i, we first use the definition (A. 10) of ffe+i, 
and then the update formulas for 9k+i and r^+i, to obtain 

f fc+ i = r k+1 + {A^ 2 A 21 + L k+1 )9 k+1 

= r k - i k (A 21 9 k + A 22 r k - W k ) + (A^A 21 + L k+1 )9 k+1 

= r k - i k (A 2l 9 k + A 22 (r k - (L k + A 22 1 A 21 )9 k ) - W k ) 

+ {A 22 A 2 i + L k+1 )9 k+ i 

= r k - i k {A 22 f k - A 22 L k 6 k - W k ) + M k+1 6 k+1 

= r k + M k+ i6 k - i k (A 22 r k - A 22 L k k - W k ) 

- f3 k M k+1 (B^9 k + A 12 r k - V k ) 



r k + M k 9 k - 7 fc 



L k - L k+ i f3 k k 
A 22 L k H M k+1 B U 

lk lk 



+ lkW k - i k [a 22 + ^-M k+1 A 12 ^ r k + !3 k M k+l V k 

= r k - i k {B^9 k + B$ 2 f k ) + i k W k + (3 k M k+1 V k , 
which is the desired formula. 

A.2. Convergence of the recursion (2.12). 

Lemma A.l. For ko sufficiently large, the (deterministic) sequence of 
matrices {L k } defined by (2.12) is well defined and converges to zero. 

Proof. The recursion (2.12) can be rewritten, for k > ko, as 
L k +i = (I — lkA 22 )L k 

(A.2) 

+ p k (A^A 2l B k n + lk A 22 )L k B^)(I - k B^y\ 
which is of the form 

Lfc+i = (I - i k A 22 )L k + p k D k (L k ), 

for a sequence of matrix- valued functions D k {L k ) defined in the obvious 
manner. Since —A 22 is Hurwitz, there exists a quadratic norm 

\x\q = y/x'Qx, 

a corresponding induced matrix norm, and a constant a > such that 

\{I-lA 22 )\ Q <{l-ai) 
for every sufficiently small 7. It follows that 

\(I- 7 A 22 )L\ Q <(l-ai)\L\ Q 
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for all matrices L of appropriate dimensions and for 7 sufficiently small. 
Therefore, for sufficiently large k, we have 

\L k +i\Q < (1 - 7fca)|Lfc|g + /3 fc |£>(L fc )|Q. 

For &o sufficiently large, the sequence of functions {-Dfc(-)}fc>fc is wen defined 
and uniformly bounded on the unit Q-ball {L : \L\q < 1}. To see this, note 
that as long as {L^Iq < 1, we have \B^\ = |A — A^L^l < c, for some absolute 
constant c. With small enough, the matrix / — is invertible, and 

satisfies |(J - /9*-Bji) -1 | < 2. With (5^1 bounded by c, we have 

\A^A 21 B k n + (J — TfeAaajLfcSfil < d(l + |L fc |), 

for some absolute constant d. To summarize, for large k, if |£fc|Q < 1, we 
have |Dfc(Lfc)| < 4(i. Since any two norms on a finite-dimensional vector 
space are equivalent, we have 

\Lk+x\Q < (1 - 7fca)|-^fc|Q + (7fc«) f— - 

for some constant d\ > 0. Recall now that the sequence is initialized with 
Lfc = 0. If fco is large enough so that diP^/a-y^ < 1> then |L^|q < 1 for all k. 
Furthermore, since 1 — x < e~ x , we have 

The rest follows from Lemma A. 3 as Pk/lk ~^ 

A. 3. Linear matrix iterations. Consider a linear matrix iteration of the 
form 

for some square matrices A, B, step-size sequence f3f. and sequence of matrix- 
valued affine functions £&(•). Assume: 

1. The real parts of the eigenvalues of A are positive and the real parts of 
the eigenvalues of B are nonnegative. (The roles of A and B can also be 
interchanged.) 

2. 0k is positive and 

k 

3. limfc<Sfc(-) = 0. 

We then have the following standard result whose proof can be found, for 
example, in Polyak (1976). 

Lemma A. 2. For any £0, nm fc £fc — £* exists and is the unique solution 
to the equation 

,4£ + £5 = r. 
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A. 4. Convergence of some series. We provide here some lemmas that are 
used in the proof of asymptotic normality. Throughout this section, {7/J is 
a positive sequence such that: 

1. 7^ — ► 0, and 

Furthermore, {tk} is the sequence defined by 

k-l 

to = 0, *fc = 5l7fc, k>0. 

j=0 



Lemma A. 3. For any nonnegative sequence {5k} that converges to zero 
and any p>0, we have 



k /k-l \p / k-l 



(A.3) limE-WlE-yi exp -E7i ^- = 0. 

7=0 \ i=j / \ i=j / 



Proof. Let 8(-) be a nonnegative function on [0, 00) defined by 
8(t) = 5 k , t k <t<t k+1 . 
Then it is easy to see that for any ko > 0, 

k /k-l \P / k-l \ 

E 7j ( E 7i J expl -E 7i J Sj 

j=k \ i=j J \ i=j J 

= f tk (tk-s)Pe-^-^5(s)ds + e k k °, 



where 



k (k-l \P / k-l \ 

M<^E7j E7, exp -E7* *J 

j=k \ i=j / \ i=j / 



for some constant c > 0. Therefore, for ko sufficiently large, we have 

k /k-l \P / k-l \ 



l T E 7j I E 7i J ex P ( ~ E 7i j $3 

j=k \ i=j / \ i=j J 



„ lim f / ^(s)(t-g)Pe-( t - a ) ds 
l-csup fc > fco 7fc 
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To calculate the above limit, note that 

ft 



lim 

f 



(t-s) p e-^ s) 5(s)ds 



o 

lim 

t 



s p e- s 5(t-s)ds 
<\im( sup \S(a)\) [ T s p e- s ds + sup\5(s)\ f°° s p e~ s ds 

* \s>t-T J JO s JT 

POO 

= sup|£(s)| / s p e~ s ds. 

s JT 

Since T is arbitrary, the above limit is zero. Finally, note that the limit 
in (A. 3) does not depend on the starting limit of the summation. □ 

Lemma A. 4. For each p>0, there exists K p > such that for any k > 
j>0, 

k /i-1 \P I i-1 \ 

Z)7i LEtj exp [-Y^ti )< k p- 

i=3 V 1=3 ' \ 1=3 ' 



Proof. For all j sufficiently large, we have 

k \P ( i-1 \ f ( 

E^fE^j exp(-E7.)<^ 

for some c > 0. □ 



i-i \p / i-i \ jfa -^y^r 

. csu P/ > i7 i : 
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